“Forecasting” is always the most increasing topic in the house-price market and has gained importance and popularity among the general public and companies of all sizes due to its economic benefits and low risk. At the same time, the real estate industry accounts for a large part of GPA in the United States, which can directly influence citizens’ consumption level as well as the overall domestic economy[1]. Zillow is one of the most popular tech real estate companies founded in 2006. They have realized that its housing market predictions are not so accurate as the estimation is very complex as it involves a wide variety of attributes, like crime rate, park number, land use type and so on. Accurate house price forecasts are not only beneficial to investors and consumers, but are also the basis for stable economic development.
So as for this project, our aim to bring geospatial analysis and machine learning method together to generate our model. Through Ordinary Least Squares (OLS) linear regression to estimate the housing price in Mecklenberg County, NC.
The project has encountered difficulties and has some limitations. As we all know, house prices are an integrated and diverse result related to the internal characteristics of the house itself, amenities, public features, the spatial structure of the city, and even to the development experience the United States. Many geospatial factors cannot be quantified,which will lead to inaccuracy. The acquisition and pre-processing of data from different sources and formats is also a challenge, and possible errors in the data will propagate and accumulate. What’s more, the OLS model is not so appropriate for house price prediction.
Our main strategy is to use many factors that may have an impact on house prices, combine with OLS to train and test our model. The steps are divided into four parts: Data Wrangling; Exploratory Analysis; Feature Engineering and Selection; Model Estimation and Validation. The details are in the Section 3.
Our result model is based on 39 features. Within the 27.38% price prediction accuracy range (MAPE of 0.27), it explains about 69% of the variation in Mecklenberg house prices (adjusted R-squared of 0.69). The model’s Moran’s I shows that there is still some degree of spatial autocorrelation in the errors (Moran’s I of 0.19). We divided the study area into four groups. The results show that there is no significant difference in race, income, percentage of vacant and renter occupied, meaning that our model is conditional generalizable. To a certain extent, we can believe that our OLS model is possible to work as a reference for Zillow to predict house price.
knitr::opts_chunk$set(echo = TRUE)
options(scipen=999)
options(tigris_class = "sf")
# Load some libraries
library(factoextra)
library(FactoMineR)
library(tidyverse)
library(tidycensus)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot) # plot correlation plot
library(corrplot)
library(corrr) # another way to plot correlation plot
library(kableExtra)
library(jtools) # for regression model plots
library(ggstance) # to support jtools plots
library(ggpubr) # plotting R^2 value on ggplot point scatter
library(broom.mixed) # needed for effects plots
library(stargazer)
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
palette5 <- c("#25CB10", "#5AB60C", "#8FA108", "#C48C04", "#FA7800")
census_api_key("155cc525674a0d27c98afbb7030d0802e9bb5543", overwrite = TRUE)
According to the hedonic model, house prices are affected by three constituent parts: 1)physical characteristics; 2)public services or amenities; 3)the spatial process of prices. Therefore, we gather data from those three aspects.
Physical characteristics are the intrinsic characteristics of houses, like the number of bedrooms of the house or the heated area. We choose 8 physical characteristics in total like heated area, full bathrooms, bldggrade, half bathrooms, bedrooms, house size, Area, house age as features to predict the price, all of them are from studentData.geojson on the github.
Public services or amenities are like crimes or schools, we choose 19 characteristics of this part, and percentage of people with bachelor’s degree or more, percentage of poverty, percentage of white residents, vacant lands, population density, commercial construction, job density, financial services, impervious surface, house density, single family, violent crime, public transportation, park, library, shopping center, schools, medical places are used to predict house price. Those data are from both ACS and Mecklenberg County’s Open Data portal.
We also create some new features from existing data by using KNN. The new feature are 5 nearest neighbors of parks, libraries, shopping centers, schools, and medical places.
Mecklenburg.sf <-
st_read("./data/studentData/studentData.geojson") %>%
dplyr::select(lot_num,totalac, municipali, price, yearbuilt,city, heatedarea, storyheigh, aheatingty, heatedfuel, actype, numfirepla, bldggrade, fullbaths, halfbaths, bedrooms, units, accounttyp, landusecod, shape_Area, toPredict, musaID) %>%
mutate(house_age = 2022 - yearbuilt,
pricepersq=price/shape_Area) %>%
st_transform('EPSG:32119') %>%
na.omit()
Mecklenburg.challenge <-
Mecklenburg.sf[Mecklenburg.sf$toPredict == "CHALLENGE",]
Mecklenburg.tracts20 <-
get_acs(geography = "tract", variables = c("B25026_001E","B02001_002E",
"B19013_001E","B15001_009",
"B06012_002E", "B15001_050",
# vacant variables
"B25002_003E",
"B25004_002E","B25004_003E",
"B25004_004E","B25004_005E",
# total housing unit
"B25001_001E",
# renter occupied
'B08137_003E'),
year=2020, state= "NC", county= "Mecklenburg", geometry=T, output="wide") %>%
st_transform('EPSG:32119') %>%
rename(TotalPop = B25026_001E,
Whites = B02001_002E,
FemaleBachelors = B15001_050E,
MaleBachelors = B15001_009E,
MedHHInc = B19013_001E,
TotalPoverty = B06012_002E,
TotalVacant = B25002_003E,
ForRent = B25004_002E,
ForRentVac = B25004_003E,
ForSale = B25004_004E,
ForSaleVac = B25004_005E,
TotalUnit = B25001_001E,
RenterOccupied = B08137_003E) %>%
dplyr::select(-NAME, -starts_with("B")) %>%
mutate(pctWhite = ifelse(TotalPop > 0, Whites / TotalPop, 0),
pctBachelors = ifelse(TotalPop > 0, ((FemaleBachelors + MaleBachelors) / TotalPop), 0),
pctPoverty = ifelse(TotalPop > 0, TotalPoverty / TotalPop, 0),
pctTotalVacant = ifelse(TotalUnit > 0, TotalVacant / TotalUnit * 100, 0),
TotalOccupied = TotalUnit - TotalVacant,
pctRenterOccupied = ifelse(TotalOccupied >0, RenterOccupied/TotalOccupied, 0)) %>%
dplyr::select(-Whites, -FemaleBachelors, -MaleBachelors, -TotalPoverty)
Mecklenburg.neighborhood <-
st_read("./data/Quality of Life Explorer/Character/Area.geojson") %>%
st_transform('EPSG:32119')
## Join studentData with ACS data
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.tracts20)
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.neighborhood)
## External Data (Quality of Life)
## 1. Character
Mecklenburg.vacant_land <-
st_read("./data/Quality of Life Explorer/Character/Vacant Land.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(X2021.raw) %>%
rename(vacant_land = X2021.raw) %>%
na.omit()
Mecklenburg.race_AF <-
st_read("./data/Quality of Life Explorer/Character/Race_Ethnicity - Black or African American.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(X2020) %>%
rename(race_AF = X2020) %>%
na.omit()
Mecklenburg.age_Res <-
st_read("./data/Quality of Life Explorer/Character/Age of Residents.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(X2020) %>%
rename(age_Res = X2020) %>%
na.omit()
Mecklenburg.older_pop <-
st_read("./data/Quality of Life Explorer/Character/Population - Older Adult.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(X2020) %>%
rename(older_pop = X2020) %>%
na.omit()
Mecklenburg.pop_density <-
st_read("./data/Quality of Life Explorer/Character/Population Density.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(X2020.raw) %>%
rename(pop_density = X2020.raw) %>%
na.omit()
## 2. Economy
Mecklenburg.employment <-
st_read("./data/Quality of Life Explorer/Economy/Employment.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(X2020) %>%
rename(employment = X2020) %>%
na.omit()
Mecklenburg.commercial_construction <-
st_read("./data/Quality of Life Explorer/Economy/Commercial Construction.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(X2020.raw) %>%
rename(commercial_construction = X2020.raw) %>%
na.omit()
Mecklenburg.job_density <-
st_read("./data/Quality of Life Explorer/Economy/Job Density.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(X2019.raw) %>%
rename(job_density = X2019.raw) %>%
na.omit()
Mecklenburg.financial_services <-
st_read("./data/Quality of Life Explorer/Economy/Proximity to Financial Services.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(X2020.raw) %>%
rename(financial_services = X2020.raw) %>%
na.omit()
Mecklenburg.food_service <-
st_read("./data/Quality of Life Explorer/Economy/Food and Nutrition Services.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(X2020) %>%
rename(food_service = X2020)%>%
na.omit()
## 3. Education
Mecklenburg.early_edu <-
st_read("./data/Quality of Life Explorer/Education/Proximity to Early Care and Education.geojson") %>%
st_transform('EPSG:32119')%>%
dplyr::select(X2020.raw) %>%
rename(early_edu = X2020.raw) %>%
na.omit()
## 4. Environment
Mecklenburg.drive_alone <-
st_read("./data/Quality of Life Explorer/Environment/Commuters Driving Alone.geojson") %>%
st_transform('EPSG:32119')%>%
dplyr::select(X2020.raw) %>%
rename(drive_alone = X2020.raw) %>%
na.omit()
Mecklenburg.impervious_surface <-
st_read("./data/Quality of Life Explorer/Environment/Impervious Surface.geojson") %>%
st_transform('EPSG:32119')%>%
dplyr::select(X2021.raw) %>%
rename(impervious_surface = X2021.raw) %>%
na.omit()
Mecklenburg.gas_consumption <-
st_read("./data/Quality of Life Explorer/Environment/Energy Consumption - Natural Gas.geojson") %>%
st_transform('EPSG:32119')%>%
dplyr::select(X2013.raw) %>%
rename(gas_consumption = X2013.raw) %>%
na.omit()
Mecklenburg.tree_res <-
st_read("./data/Quality of Life Explorer/Environment/Tree Canopy - Residential.geojson") %>%
st_transform('EPSG:32119')%>%
dplyr::select(X2012.raw) %>%
rename(tree_res = X2012.raw) %>%
na.omit()
## 5. Health
Mecklenburg.recreation <-
st_read("./data/Quality of Life Explorer/Health/Proximity to Public Outdoor Recreation.geojson") %>%
st_transform('EPSG:32119')%>%
dplyr::select(X2020.raw) %>%
rename(recreation = X2020.raw) %>%
na.omit()
Mecklenburg.age_death <-
st_read("./data/Quality of Life Explorer/Health/Age of Death.geojson") %>%
st_transform('EPSG:32119')%>%
dplyr::select(X2019) %>%
rename(age_death = X2019) %>%
na.omit()
## 6. Housing
Mecklenburg.house_size <-
st_read("./data/Quality of Life Explorer/Housing/Housing Size.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(X2020) %>%
rename(house_size = X2020)%>%
na.omit()
Mecklenburg.house_density <-
st_read("./data/Quality of Life Explorer/Housing/Housing Density.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(X2020.raw) %>%
rename(house_density = X2020.raw)%>%
na.omit()
Mecklenburg.residential_occupancy <-
st_read("./data/Quality of Life Explorer/Housing/Residential Occupancy.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(X2020) %>%
rename(residential_occupancy = X2020)%>%
na.omit()
Mecklenburg.single_family <-
st_read("./data/Quality of Life Explorer/Housing/Single-Family Housing.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(X2020.raw) %>%
rename(single_family = X2020.raw)%>%
na.omit()
## 7. Safety
Mecklenburg.violent_crime <-
st_read("./data/Quality of Life Explorer/Safety/Crime - Violent.geojson") %>%
st_transform('EPSG:32119')%>%
dplyr::select(X2020.raw) %>%
rename(violent_crime = X2020.raw)%>%
na.omit()
## 8. Transportation
Mecklenburg.public_transportation <-
st_read("./data/Quality of Life Explorer/Transportation/Proximity to Public Transportation.geojson") %>%
st_transform('EPSG:32119')%>%
dplyr::select(X2020.raw) %>%
rename(public_transportation = X2020.raw)
## Points Data
Mecklenburg.park <-
st_read("./data/parks.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(geometry)
Mecklenburg.library <-
st_read("./data/Libraries.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(geometry)
Mecklenburg.medical <-
st_read("./data/Medical_Facilities.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(geometry)
Mecklenburg.shopping <-
st_read("./data/Existing_Shopping_Centers.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(geometry)
Mecklenburg.schools <-
st_read("./data/Schools.geojson") %>%
st_transform('EPSG:32119') %>%
dplyr::select(geometry)
## Create Nearest Neighbor Feature
## 1. Character
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.vacant_land) %>%
na.omit()
# Mecklenburg.sf <-
# st_join(Mecklenburg.sf, Mecklenburg.older_pop)%>%
# na.omit()
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.pop_density)%>%
na.omit()
# Mecklenburg.sf <-
# st_join(Mecklenburg.sf, Mecklenburg.age_Res)%>%
# na.omit()
Mecklenburg.challenge <-
Mecklenburg.sf[Mecklenburg.sf$toPredict == "CHALLENGE",]
## 2. Economy
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.employment) %>%
na.omit()
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.commercial_construction)%>%
na.omit()
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.job_density)%>%
na.omit()
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.financial_services)%>%
na.omit()
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.food_service)%>%
na.omit()
## 3. Education
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.early_edu)%>%
na.omit()
## 4. Environment
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.impervious_surface)%>%
na.omit()
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.drive_alone)%>%
na.omit()
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.gas_consumption)%>%
na.omit()
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.tree_res)%>%
na.omit()
## 5. Health
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.recreation)%>%
na.omit()
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.age_death)%>%
na.omit()
## 6. Housing
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.house_size)%>%
na.omit()
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.house_density)%>%
na.omit()
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.single_family)%>%
na.omit()
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.residential_occupancy)%>%
na.omit()
## 7. Safety
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.violent_crime)%>%
na.omit()
## 8. Transportation
Mecklenburg.sf <-
st_join(Mecklenburg.sf, Mecklenburg.public_transportation)%>%
na.omit()
## Points Data
## Park
Mecklenburg.sf <-
Mecklenburg.sf %>%
mutate(
park_nn1 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.park)), 1),
park_nn2 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.park)), 2),
park_nn3 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.park)), 3),
park_nn4 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.park)), 4),
park_nn5 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.park)), 5))%>%
na.omit()
## Library
Mecklenburg.sf <-
Mecklenburg.sf %>%
mutate(
library_nn1 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.library)), 1),
library_nn2 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.library)), 2),
library_nn3 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.library)), 3),
library_nn4 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.library)), 4),
library_nn5 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.library)), 5))%>%
na.omit()
## Shopping center
Mecklenburg.sf <-
Mecklenburg.sf %>%
mutate(
shopping_nn1 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.shopping)), 1),
shopping_nn2 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.shopping)), 2),
shopping_nn3 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.shopping)), 3),
shopping_nn4 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.shopping)), 4),
shopping_nn5 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.shopping)), 5))%>%
na.omit()
## Schools
Mecklenburg.sf <-
Mecklenburg.sf %>%
mutate(
school_nn1 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.schools)), 1),
school_nn2 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.schools)), 2),
school_nn3 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.schools)), 3),
school_nn4 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.schools)), 4),
school_nn5 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.schools)), 5))%>%
na.omit()
## Crime
# Mecklenburg.sf <-
# Mecklenburg.sf %>%
# mutate(
# crime_nn1 =
# nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.crime)), 1),
# crime_nn2 =
# nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.crime)), 2),
# crime_nn3 =
# nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.crime)), 3),
# crime_nn4 =
# nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.crime)), 4),
# crime_nn5 =
# nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.crime)), 5))
## Medical
Mecklenburg.sf <-
Mecklenburg.sf %>%
mutate(
medical_nn1 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.medical)), 1),
medical_nn2 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.medical)), 2),
medical_nn3 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.medical)), 3),
medical_nn4 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.medical)), 4),
medical_nn5 =
nn_function(st_coordinates(st_centroid(Mecklenburg.sf)), st_coordinates(st_centroid(Mecklenburg.medical)), 5))%>%
na.omit()
This table statistics some features of independent variables.
Mecklenburg.select_data <-
Mecklenburg.sf %>%
select(price,heatedarea, fullbaths, bldggrade, halfbaths, bedrooms, shape_Area, house_age, MedHHInc, TotalVacant, TotalUnit, pctWhite, pctPoverty, vacant_land,pop_density, commercial_construction, job_density, financial_services, impervious_surface, house_size,house_density, single_family, violent_crime, public_transportation, park_nn2,park_nn3,park_nn5, library_nn1,library_nn2,library_nn3,library_nn4,library_nn5, shopping_nn1, shopping_nn2,shopping_nn3, school_nn2,school_nn3, medical_nn3,medical_nn4,medical_nn5)%>%
st_drop_geometry()
internal.feature.list <- Mecklenburg.select_data %>%
dplyr:: select(heatedarea, fullbaths, bldggrade, halfbaths, bedrooms, shape_Area, house_age, house_size,house_density)
stargazer(internal.feature.list, type = "html",
title = "Table 2. Summary Statistics of Internal Characteristics",
header = FALSE,
single.row = TRUE)
| Statistic | N | Mean | St. Dev. | Min | Max |
| heatedarea | 45,341 | 2,362.307 | 1,061.263 | 0.000 | 14,710.000 |
| fullbaths | 45,341 | 2.283 | 0.827 | 0 | 9 |
| halfbaths | 45,341 | 0.595 | 0.528 | 0 | 4 |
| bedrooms | 45,341 | 3.510 | 0.944 | 0 | 65 |
| shape_Area | 45,341 | 15,933.740 | 35,219.760 | 1,139.637 | 3,486,865.000 |
| house_age | 45,341 | 28.652 | 24.051 | 0 | 194 |
| house_size | 45,341 | 2,281.600 | 730.491 | 1,005.000 | 5,494.000 |
| house_density | 45,341 | 1,308.028 | 910.955 | 69.100 | 11,163.000 |
piblic.feature.list <- Mecklenburg.select_data %>%
dplyr:: select(financial_services, public_transportation, park_nn2,park_nn3,park_nn5, library_nn1,library_nn2,library_nn3,library_nn4,library_nn5, shopping_nn1, shopping_nn2,shopping_nn3, school_nn2,school_nn3, medical_nn3,medical_nn4,medical_nn5)
stargazer(piblic.feature.list, type = "html",
title = "Table 2. Summary Statistics of Public Services Characteristics",
header = FALSE,
single.row = TRUE)
| Statistic | N | Mean | St. Dev. | Min | Max |
| financial_services | 45,341 | 363.688 | 786.328 | 0 | 9,462 |
| public_transportation | 45,341 | 703.682 | 942.436 | 0 | 11,641 |
| park_nn2 | 45,341 | 1,513.375 | 790.582 | 63.144 | 5,064.762 |
| park_nn3 | 45,341 | 1,738.650 | 819.041 | 90.029 | 5,653.027 |
| park_nn5 | 45,341 | 2,147.716 | 880.371 | 184.989 | 6,754.394 |
| library_nn1 | 45,341 | 4,026.632 | 2,238.726 | 73.891 | 10,454.780 |
| library_nn2 | 45,341 | 5,452.666 | 2,569.007 | 333.465 | 13,914.410 |
| library_nn3 | 45,341 | 6,379.752 | 2,914.899 | 635.328 | 16,460.460 |
| library_nn4 | 45,341 | 7,066.235 | 3,112.351 | 886.952 | 17,780.290 |
| library_nn5 | 45,341 | 7,833.089 | 3,229.467 | 1,017.134 | 18,692.330 |
| shopping_nn1 | 45,341 | 1,774.212 | 1,306.096 | 43.122 | 9,488.096 |
| shopping_nn2 | 45,341 | 2,023.224 | 1,314.477 | 113.299 | 9,689.564 |
| shopping_nn3 | 45,341 | 2,232.628 | 1,318.495 | 147.817 | 9,789.222 |
| school_nn2 | 45,341 | 1,534.650 | 857.392 | 97.415 | 6,752.533 |
| school_nn3 | 45,341 | 1,803.053 | 939.254 | 249.679 | 7,589.063 |
| medical_nn3 | 45,341 | 2,046.510 | 1,365.741 | 45.589 | 8,104.070 |
| medical_nn4 | 45,341 | 2,164.763 | 1,388.329 | 48.756 | 8,406.003 |
| medical_nn5 | 45,341 | 2,269.014 | 1,408.529 | 52.703 | 8,627.653 |
spatial.feature.list <- Mecklenburg.select_data %>%
dplyr:: select(MedHHInc, TotalVacant, TotalUnit, pctWhite, pctPoverty, vacant_land,pop_density, commercial_construction, job_density, impervious_surface, single_family,violent_crime)
stargazer(spatial.feature.list, type = "html",
title = "Table 2. Summary Statistics of Spatial Structure Characteristics",
header = FALSE,
single.row = TRUE)
| Statistic | N | Mean | St. Dev. | Min | Max |
| MedHHInc | 45,341 | 86,346.610 | 37,587.750 | 17,685 | 238,750 |
| TotalVacant | 45,341 | 95.336 | 75.152 | 0 | 405 |
| TotalUnit | 45,341 | 1,606.760 | 464.513 | 249 | 2,778 |
| pctWhite | 45,341 | 0.571 | 0.272 | 0.011 | 1.710 |
| pctPoverty | 45,341 | 0.090 | 0.082 | 0.000 | 0.614 |
| vacant_land | 45,341 | 243.841 | 393.899 | 0.000 | 1,755.900 |
| pop_density | 45,341 | 3,373.346 | 2,032.687 | 485.000 | 16,580.900 |
| commercial_construction | 45,341 | 8.667 | 18.127 | 0 | 485 |
| job_density | 45,341 | 1,322.237 | 3,911.261 | 4.000 | 96,789.000 |
| impervious_surface | 45,341 | 147.362 | 104.904 | 19.000 | 706.700 |
| single_family | 45,341 | 953.650 | 585.037 | 2 | 3,649 |
| violent_crime | 45,341 | 9.730 | 15.200 | 0 | 198 |
This table reflects the correlation between each variables. It can help us to find variables which have strong correlations between each other. In that case, we can reduce redundant features.
numericVars <-
select_if(st_drop_geometry(Mecklenburg.select_data), is.numeric) %>% na.omit()
M <- cor(numericVars,use="pairwise.complete.obs")
corrplot(M, method = "square",type = "lower",tl.cex = 0.5)
The first two plots shows the correlations between price and house intrinsic characteristics, which shows strong correlations. It means the intrinsic characteristics of houses are important for price prediction. To our surprise, the school feature doesn’t show a strong correlation to the price which is somewhat against common sense. For the plot of pctWhite which represents the correlation of percentage of white people and price, it reflects some outliers. This plot can help us to find these outliers and remove them.
st_drop_geometry(Mecklenburg.sf) %>%
dplyr::select(price, heatedarea,pctWhite,school_nn2,house_size) %>%
filter(price <= 10000000) %>%
gather(Variable, Value,-price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Price as a function of continuous variables") +
plotTheme()
This map shows the spatial difference of house sale price.
new_palette5 = c("#EFFFFD", "#B8FFF9", "#85F4FF", "#42C2FF", "#03045E")
Mecklenburg.modelling <-
Mecklenburg.sf[Mecklenburg.sf$toPredict == "MODELLING",] %>%
filter(price < 10000000) %>%
filter(price > 0) %>%
na.omit()
ggplot() +
geom_sf(data=st_union(Mecklenburg.tracts20)) +
geom_sf(data = Mecklenburg.sf, aes(color = q5(price)), show.legend = "point",size = 0.8) +
scale_color_manual(values = alpha(new_palette5, .6),
labels = qBr(Mecklenburg.modelling,'price'),
name = "Price, $") +
labs(title = "Map of House Price in Mecklenburg\n") +
mapTheme() +
plotTheme()
Similar to the map of house price, these three maps reflects spatial distributions of heated area, house area, and violent crime. From the plot we can find, the most southern and northern parts has better heat conditions, larger house area, and less violent crimes. The middle area is diametrically opposed. The plot reflects spatial difference of the features which is helpful for us to understand the characteristics of each region and choose more useful features for the model.
ggplot() +
geom_sf(data=st_union(Mecklenburg.tracts20)) +
geom_sf(data = Mecklenburg.sf, aes(color = q5(heatedarea)), show.legend = "point",size = 0.8) +
scale_color_manual(values = alpha(new_palette5, .6),
labels = qBr(Mecklenburg.sf,'heatedarea'),
name = "Heated Area") +
labs(title = "Map of Heated Area in Mecklenburg\n") +
mapTheme() +
plotTheme()
ggplot() +
geom_sf(data=st_union(Mecklenburg.tracts20)) +
geom_sf(data = Mecklenburg.sf, aes(color = q5(shape_Area)), show.legend = "point",size = 0.8) +
scale_color_manual(values = alpha(new_palette5, .6),
labels = qBr(Mecklenburg.sf,'shape_Area'),
name = "Heated Area") +
labs(title = "Map of House Area in Mecklenburg\n") +
mapTheme() +
plotTheme()
ggplot() +
geom_sf(data=st_union(Mecklenburg.tracts20)) +
geom_sf(data = Mecklenburg.sf, aes(color = q5(violent_crime)), show.legend = "point",size = 0.8) +
scale_color_manual(values = alpha(new_palette5, .6),
labels = qBr(Mecklenburg.sf,'violent_crime'),
name = "Heated Area") +
labs(title = "Map of violent crime in Mecklenburg\n") +
mapTheme() +
plotTheme()
The first plot shows each feature’s contribution to the PCA, which can reflect which feature is more important in some extent.
The rest plots are heated maps to reflect density of certain feature’s spatial distribution.
Mecklenburg.numeric <- Mecklenburg.sf[, unlist(lapply(Mecklenburg.sf,
is.numeric))]
Mecklenburg.numeric_nogeo<- Mecklenburg.numeric %>% st_drop_geometry()
Mecklenburg.pca <- prcomp(Mecklenburg.numeric_nogeo, center = TRUE,scale. = TRUE)
Mecklenburg.pca_plot2<-fviz_contrib(Mecklenburg.pca, choice = "var", axes = 1:2)
Mecklenburg.pca_plot2 +
theme(axis.text = element_text(size = 10),
axis.text.x = element_text(size = 9, angle = 55))
ggplot() +
geom_sf(data = st_union(Mecklenburg.tracts20)) +
stat_density2d(data = data.frame(st_coordinates(Mecklenburg.public_transportation)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.002, bins = 30, geom = 'polygon') +
scale_fill_gradient(low = "#c4fff9", high = "#07beb8", name = "public_transportation") +
scale_alpha(range = c(0.00, 0.2), guide = FALSE) +
labs(title = "Density of recreations, Mecklenburg") +
mapTheme()
ggplot() + geom_sf(data = st_union(Mecklenburg.tracts20)) +
stat_density2d(data = data.frame(st_coordinates(Mecklenburg.shopping)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.002, bins = 30, geom = 'polygon') +
scale_fill_gradient(low = "#c4fff9", high = "#07beb8", name = "shopping") +
scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
labs(title = "Density of schools, Mecklenburg") +
mapTheme()
ggplot() + geom_sf(data = st_union(Mecklenburg.tracts20)) +
stat_density2d(data = data.frame(st_coordinates(Mecklenburg.job_density)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.002, bins = 30, geom = 'polygon') +
scale_fill_gradient(low = "#c4fff9", high = "#07beb8", name = "Job") +
scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
labs(title = "Density of jobs, Mecklenburg") +
mapTheme()
For this project, we aim to train a model to predict the house price as accurately as possible and we will combine geospatial analytics and machine learning method (OLS) together to generate our model. The steps are divided into four sections:
Data wrangling is the first step to clean all the data we collected. We import all the data and set them to the same coordinate and format so that subsequent operations can be performed. In addition, we need to remove some abnormal data like NA. Then, merge all the data to one dataset according to the geographic location relationship and separate them into two parts, training data and testing data, normally it’s 80%:20%. So we complete the preparation.
Exploratory analysis helps us get a better a glimpse of data. It is a good approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods.
The feature engineering and selection is done in the following steps:
In order to choose the most effective features, we have put all the features into the OLS model to test the fitness firstly, then summarized the model’s performance, the output gives the p-value and the coefficient for each variable. P-value is the probability, which reflects the likelihood of an event occurring. In other words, it reflects the probability of having the coefficient assuming there is no correlation between the independent and the dependent variables. In general, the p-value should be smaller than 0.05. R-squared is also calcutated, which is a goodness-of-fit measure for linear regression models. It measures the strength of the relationship between your linear model and the dependent variables on a 0 - 100% scale, which means how close the data are to the fitted regression line. In general, the higher the R-squared, the better the model fits your data.
We checked each p-value of the variables, which tell us what variables to keep or remove. After processing all the data, we ran OLS model again to compare the R-squared value. We have done multiple trials to select the best set of predictors.
After the final model is obtained, We have evaluated the model’s performance by R-Squared, MAE, MAPE and also conducted cross-validation. Moran’s I is calculated to assess the spatial autocorrelation of the residuals and to assess the generalizability of the model using race, income, income, percentage of vacant and renter occupied data. Zillow manager can visualize the spatial autocorrelation results of the model through Moran’s I and assess the generalizability based on tables of different background data.
First of all, We separated the Modelling data and the Challenge data according to “toPredict” field, then divided the Modelling Data set into training data and test data, the ratio is 8:2.
set.seed(56235)
Mecklenburg.sf <-
Mecklenburg.sf %>%
na.omit()
Mecklenburg.modelling <-
Mecklenburg.sf[Mecklenburg.sf$toPredict == "MODELLING",] %>%
filter(price < 10000000) %>%
filter(price > 0)
Train <-
createDataPartition(y = paste(Mecklenburg.modelling$lot_num), p=0.8, list=F)
Mecklenburg.train <-
Mecklenburg.sf[Train,] %>%
filter(price < 10000000) %>%
filter(price > 0) %>%
na.omit()
Mecklenburg.test <-
Mecklenburg.sf[-Train,] %>%
filter(price < 10000000) %>%
filter(price > 0)%>%
na.omit()
Mecklenburg.challenge <-
Mecklenburg.sf[Mecklenburg.sf$toPredict == "CHALLENGE",]
Model1 <- lm(price ~ ., data = as.data.frame(Mecklenburg.train) %>%
dplyr::select(price,heatedarea, fullbaths, bldggrade, halfbaths, bedrooms, shape_Area, house_age, MedHHInc, TotalVacant, TotalUnit, pctWhite, pctPoverty, vacant_land,pop_density, commercial_construction,job_density, financial_services, impervious_surface, house_size,house_density, single_family, violent_crime, public_transportation, park_nn2,park_nn3,park_nn5, library_nn1,library_nn2,library_nn3,library_nn4,library_nn5, shopping_nn1, shopping_nn2,shopping_nn3, school_nn2,school_nn3, medical_nn3,medical_nn4,medical_nn5))
require(broom)
tidy(Model1)%>% kable() %>% kable_styling("striped", full_width = F)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -589.875784 | 11286.3099636 | -0.0522647 | 0.9583181 |
| heatedarea | 117.742249 | 2.1799902 | 54.0104485 | 0.0000000 |
| fullbaths | 30518.304270 | 2398.0293332 | 12.7264099 | 0.0000000 |
| bldggradeCUSTOM | 1422819.145248 | 17399.0560938 | 81.7756514 | 0.0000000 |
| bldggradeEXCELLENT | 621760.392032 | 9500.8813240 | 65.4423912 | 0.0000000 |
| bldggradeFAIR | -27185.063311 | 9226.1121763 | -2.9465351 | 0.0032156 |
| bldggradeGOOD | 56212.575404 | 3140.2232686 | 17.9008212 | 0.0000000 |
| bldggradeMINIMUM | -37023.688393 | 45443.6613239 | -0.8147162 | 0.4152401 |
| bldggradeVERY GOOD | 207247.631392 | 5150.8209981 | 40.2358442 | 0.0000000 |
| halfbaths | 9786.691299 | 2434.8363058 | 4.0194453 | 0.0000585 |
| bedrooms | -14895.132035 | 1433.0037424 | -10.3943427 | 0.0000000 |
| shape_Area | 1.354752 | 0.0351807 | 38.5083579 | 0.0000000 |
| house_age | 287.054390 | 64.0170322 | 4.4840315 | 0.0000073 |
| MedHHInc | 1.027735 | 0.0585072 | 17.5659519 | 0.0000000 |
| TotalVacant | 286.490631 | 18.6141373 | 15.3910238 | 0.0000000 |
| TotalUnit | -15.295481 | 2.8639178 | -5.3407544 | 0.0000001 |
| pctWhite | 37461.480319 | 6808.2532560 | 5.5023629 | 0.0000000 |
| pctPoverty | 151705.386239 | 19279.3279406 | 7.8688109 | 0.0000000 |
| vacant_land | -57.666582 | 4.9446296 | -11.6624675 | 0.0000000 |
| pop_density | -20.600938 | 2.9485742 | -6.9867458 | 0.0000000 |
| commercial_construction | 2256.721799 | 157.5523662 | 14.3236300 | 0.0000000 |
| job_density | -10.955546 | 0.7722128 | -14.1872108 | 0.0000000 |
| financial_services | 22.513701 | 3.6194575 | 6.2201867 | 0.0000000 |
| impervious_surface | 196.890334 | 19.0317241 | 10.3453756 | 0.0000000 |
| house_size | 40.008371 | 3.2892727 | 12.1632878 | 0.0000000 |
| house_density | 8.273438 | 7.1223532 | 1.1616158 | 0.2453992 |
| single_family | 32.016694 | 6.5905870 | 4.8579427 | 0.0000012 |
| violent_crime | -971.384101 | 109.3081394 | -8.8866585 | 0.0000000 |
| public_transportation | 16.396322 | 3.2617575 | 5.0268366 | 0.0000005 |
| park_nn2 | 31.229070 | 7.4517298 | 4.1908484 | 0.0000279 |
| park_nn3 | -21.767661 | 10.9520610 | -1.9875401 | 0.0468701 |
| park_nn5 | -4.459184 | 5.6912329 | -0.7835181 | 0.4333281 |
| library_nn1 | 4.057601 | 1.3288621 | 3.0534401 | 0.0022640 |
| library_nn2 | -24.546828 | 4.7918202 | -5.1226522 | 0.0000003 |
| library_nn3 | 86.134111 | 8.8259271 | 9.7592139 | 0.0000000 |
| library_nn4 | -53.623143 | 6.5495622 | -8.1872865 | 0.0000000 |
| library_nn5 | -25.952219 | 2.3449607 | -11.0672296 | 0.0000000 |
| shopping_nn1 | -12.190330 | 4.4562668 | -2.7355477 | 0.0062307 |
| shopping_nn2 | 35.698107 | 10.1788744 | 3.5070780 | 0.0004536 |
| shopping_nn3 | -45.500832 | 7.7424419 | -5.8768063 | 0.0000000 |
| school_nn2 | -65.467935 | 6.2483614 | -10.4776167 | 0.0000000 |
| school_nn3 | 77.334244 | 6.4120949 | 12.0606831 | 0.0000000 |
| medical_nn3 | 145.049239 | 21.3493322 | 6.7940879 | 0.0000000 |
| medical_nn4 | -229.874447 | 43.5386823 | -5.2797750 | 0.0000001 |
| medical_nn5 | 100.560697 | 24.6402499 | 4.0811557 | 0.0000449 |
stargazer(Model1, type = "html",
title = "Table 4.1.1 Summary Statistics of Model 1 ",
header = FALSE,
single.row = TRUE, align = F, notes.align = "c")
| Dependent variable: | |
| price | |
| heatedarea | 117.742*** (2.180) |
| fullbaths | 30,518.300*** (2,398.029) |
| bldggradeCUSTOM | 1,422,819.000*** (17,399.060) |
| bldggradeEXCELLENT | 621,760.400*** (9,500.881) |
| bldggradeFAIR | -27,185.060*** (9,226.112) |
| bldggradeGOOD | 56,212.570*** (3,140.223) |
| bldggradeMINIMUM | -37,023.690 (45,443.660) |
| bldggradeVERY GOOD | 207,247.600*** (5,150.821) |
| halfbaths | 9,786.691*** (2,434.836) |
| bedrooms | -14,895.130*** (1,433.004) |
| shape_Area | 1.355*** (0.035) |
| house_age | 287.054*** (64.017) |
| MedHHInc | 1.028*** (0.059) |
| TotalVacant | 286.491*** (18.614) |
| TotalUnit | -15.295*** (2.864) |
| pctWhite | 37,461.480*** (6,808.253) |
| pctPoverty | 151,705.400*** (19,279.330) |
| vacant_land | -57.667*** (4.945) |
| pop_density | -20.601*** (2.949) |
| commercial_construction | 2,256.722*** (157.552) |
| job_density | -10.956*** (0.772) |
| financial_services | 22.514*** (3.619) |
| impervious_surface | 196.890*** (19.032) |
| house_size | 40.008*** (3.289) |
| house_density | 8.273 (7.122) |
| single_family | 32.017*** (6.591) |
| violent_crime | -971.384*** (109.308) |
| public_transportation | 16.396*** (3.262) |
| park_nn2 | 31.229*** (7.452) |
| park_nn3 | -21.768** (10.952) |
| park_nn5 | -4.459 (5.691) |
| library_nn1 | 4.058*** (1.329) |
| library_nn2 | -24.547*** (4.792) |
| library_nn3 | 86.134*** (8.826) |
| library_nn4 | -53.623*** (6.550) |
| library_nn5 | -25.952*** (2.345) |
| shopping_nn1 | -12.190*** (4.456) |
| shopping_nn2 | 35.698*** (10.179) |
| shopping_nn3 | -45.501*** (7.742) |
| school_nn2 | -65.468*** (6.248) |
| school_nn3 | 77.334*** (6.412) |
| medical_nn3 | 145.049*** (21.349) |
| medical_nn4 | -229.874*** (43.539) |
| medical_nn5 | 100.561*** (24.640) |
| Constant | -589.876 (11,286.310) |
| Observations | 36,153 |
| R2 | 0.698 |
| Adjusted R2 | 0.697 |
| Residual Std. Error | 202,421.100 (df = 36108) |
| F Statistic | 1,893.992*** (df = 44; 36108) |
| Note: | p<0.1; p<0.05; p<0.01 |
The summary of R-squared and adjusted R-squared for the training set are demonstrated in the following table. The current model reached 0.697 R-squared, indicating that 69.7% of the variance for of the dependent variable can be explained by this prediction.
Mecklenburg.model <-
Model1
broom::glance(Mecklenburg.model) %>%
kable(caption = 'Table 4.2 Summary of Model Evaluation Parameters') %>%
kable_styling("striped", full_width = F)
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.6976985 | 0.6973301 | 202421.1 | 1893.992 | 0 | 44 | -492997.5 | 986087.1 | 986477.9 | 1479500623083658 | 36108 | 36153 |
Musa.result <-
Mecklenburg.challenge %>%
mutate(prediction = predict(Mecklenburg.model, Mecklenburg.challenge)) %>%
select(musaID, prediction) %>%
st_drop_geometry()
write.csv(Musa.result,"musa_prediction.csv", row.names = FALSE)
For the test set which occupied 20% of the Modelling set, it shows that this model is not so accurate. The R-squared is about 70%, MAE is 104032.5, MAPE is 0.27.
Mecklenburg.train <-
Mecklenburg.train %>%
mutate(Price.Predict = predict(Mecklenburg.model, Mecklenburg.train),
Price.Error = price - Price.Predict,
Price.AbsError = abs(price - Price.Predict),
Price.APE = (abs(price - Price.Predict))/price)
Mecklenburg.test <-
Mecklenburg.test %>%
mutate(Regression = "Baseline Regression",
Price.Predict = predict(Mecklenburg.model, Mecklenburg.test),
Price.Error = price - Price.Predict,
Price.AbsError = abs(price - Price.Predict),
Price.APE = (abs(price - Price.Predict))/price)
Mecklenburg.test$Price.APE[is.infinite(Mecklenburg.test$Price.APE)] <- NA
Mecklenburg.test.rsq <-
cor(Mecklenburg.test$price, Mecklenburg.test$Price.Predict, method = "pearson", use = "complete.obs") ^ 2
Mecklenburg.test.mae <-
mean(Mecklenburg.test$Price.AbsError, na.rm = T)
Mecklenburg.test.mape <-
mean(Mecklenburg.test$Price.APE, na.rm = T)
Mecklenburg.test.results<-as.data.frame(
cbind(Mecklenburg.test.rsq,
Mecklenburg.test.mae,
Mecklenburg.test.mape))
colnames(Mecklenburg.test.results)<-c("R Square", "Mean Absolute Errors (MAE)","MAPE")
kable(Mecklenburg.test.results,
caption = 'Table 4.3 R Square, MAE and MAPE for Test Set') %>%
kable_styling("striped", full_width = F)
| R Square | Mean Absolute Errors (MAE) | MAPE |
|---|---|---|
| 0.7012363 | 104032.5 | 0.2738515 |
as.data.frame(Mecklenburg.test) %>%
dplyr::select(price, Price.Predict) %>%
gather(Variable, Value) %>%
ggplot(aes(Value, fill = Variable)) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c("#FA7070", "#5F9DF7")) +
labs(title="Distribution of price & price predictions",
x = "Price/Prediction", y = "Density of observations") +
plotTheme()
This section conducted Cross-Validation to assess how our model would generalize to other independent data sets.
Below is a summary table and distribution plots of R-square, RMSE, and MAE in each of 100 fold are shown below.
The variation of 100 folds can also be visualized with a histogram of across-fold MAE. The MAE in this case is a little bit high, most of them are distributed in the range of 95,000-115,000.
fitControl <- trainControl(method = "cv",
number = 100,
savePredictions = TRUE)
set.seed(99526)
Mecklenburg.CV <-
train(price ~ ., data = as.data.frame(Mecklenburg.train)%>%
dplyr::select(price,heatedarea, fullbaths, bldggrade, halfbaths, bedrooms, shape_Area, house_age, MedHHInc, TotalVacant, TotalUnit, pctWhite, pctPoverty, vacant_land,pop_density, commercial_construction,job_density, financial_services, impervious_surface, house_size,house_density, single_family, violent_crime, public_transportation, park_nn2,park_nn3,park_nn5, library_nn1,library_nn2,library_nn3,library_nn4,library_nn5, shopping_nn1, shopping_nn2,shopping_nn3, school_nn2,school_nn3, medical_nn3,medical_nn4,medical_nn5),
method = "lm", trControl = fitControl, na.action = na.pass)
## show RMSE, MAE, R^2
kable(Mecklenburg.CV$resample,
caption = 'Table 4.4 Cross-validation Test: Summary of RMSE, R-Squared and MAE') %>%
kable_styling("striped", full_width = F)
| RMSE | Rsquared | MAE | Resample |
|---|---|---|---|
| 132257.3 | 0.8063728 | 84820.55 | Fold001 |
| 161176.3 | 0.8245584 | 91627.30 | Fold002 |
| 143440.5 | 0.7856015 | 98463.79 | Fold003 |
| 129314.0 | 0.8260462 | 88971.01 | Fold004 |
| 255219.4 | 0.7656121 | 120821.66 | Fold005 |
| 197422.8 | 0.6718350 | 103693.93 | Fold006 |
| 276923.7 | 0.6130969 | 111500.39 | Fold007 |
| 232732.6 | 0.6858177 | 116387.19 | Fold008 |
| 247954.3 | 0.7128319 | 115563.49 | Fold009 |
| 197185.3 | 0.8088958 | 103750.77 | Fold010 |
| 204485.9 | 0.6986820 | 107799.66 | Fold011 |
| 170709.9 | 0.8294433 | 98909.21 | Fold012 |
| 140775.5 | 0.7804158 | 97487.33 | Fold013 |
| 122332.7 | 0.8092515 | 90851.77 | Fold014 |
| 360176.5 | 0.4231420 | 125185.84 | Fold015 |
| 242608.7 | 0.5891597 | 113806.27 | Fold016 |
| 210585.3 | 0.7378107 | 120795.35 | Fold017 |
| 225945.1 | 0.5437389 | 107771.93 | Fold018 |
| 180223.2 | 0.6920633 | 113771.32 | Fold019 |
| 211713.5 | 0.8215357 | 107660.98 | Fold020 |
| 158765.8 | 0.7371448 | 96501.41 | Fold021 |
| 130025.8 | 0.8043685 | 90157.75 | Fold022 |
| 150926.9 | 0.7956205 | 91550.51 | Fold023 |
| 155582.8 | 0.7060099 | 99306.53 | Fold024 |
| 186902.5 | 0.6279338 | 94974.17 | Fold025 |
| 261224.4 | 0.7198698 | 116014.06 | Fold026 |
| 150088.2 | 0.8106802 | 99277.38 | Fold027 |
| 174647.5 | 0.8120858 | 110684.95 | Fold028 |
| 294016.2 | 0.5557784 | 109909.05 | Fold029 |
| 165277.8 | 0.7019629 | 106194.40 | Fold030 |
| 405842.6 | 0.2658657 | 123017.19 | Fold031 |
| 153707.2 | 0.7988618 | 100984.49 | Fold032 |
| 160859.7 | 0.7394718 | 106457.13 | Fold033 |
| 210592.8 | 0.6864077 | 107114.18 | Fold034 |
| 144080.0 | 0.7654732 | 92343.26 | Fold035 |
| 306485.3 | 0.7187494 | 112315.66 | Fold036 |
| 206330.2 | 0.5911698 | 106766.14 | Fold037 |
| 272401.3 | 0.6797514 | 111245.84 | Fold038 |
| 213143.4 | 0.6634821 | 107959.97 | Fold039 |
| 188989.3 | 0.6222590 | 102960.25 | Fold040 |
| 182165.7 | 0.8131559 | 102326.44 | Fold041 |
| 162554.0 | 0.7890582 | 99603.82 | Fold042 |
| 174860.4 | 0.7662719 | 110612.53 | Fold043 |
| 201125.9 | 0.6368694 | 107114.37 | Fold044 |
| 175263.7 | 0.8202691 | 106682.15 | Fold045 |
| 136964.7 | 0.8418540 | 89897.38 | Fold046 |
| 154284.3 | 0.6999854 | 96882.70 | Fold047 |
| 209183.1 | 0.5596055 | 102941.48 | Fold048 |
| 183263.8 | 0.8233870 | 108672.04 | Fold049 |
| 141719.2 | 0.7509117 | 90963.34 | Fold050 |
| 279560.3 | 0.7047242 | 113589.62 | Fold051 |
| 301293.0 | 0.5628124 | 127806.95 | Fold052 |
| 160618.5 | 0.8091244 | 95233.29 | Fold053 |
| 186601.0 | 0.7545888 | 107837.78 | Fold054 |
| 186333.6 | 0.6383631 | 106536.79 | Fold055 |
| 421772.6 | 0.3875774 | 123887.24 | Fold056 |
| 221857.1 | 0.5784830 | 102803.71 | Fold057 |
| 143340.0 | 0.7732882 | 94426.81 | Fold058 |
| 142584.0 | 0.6887276 | 88806.77 | Fold059 |
| 165833.7 | 0.7361380 | 95620.19 | Fold060 |
| 160261.6 | 0.8014695 | 104766.62 | Fold061 |
| 249866.6 | 0.6059430 | 110747.92 | Fold062 |
| 207968.0 | 0.8085016 | 109472.87 | Fold063 |
| 161370.0 | 0.7304463 | 100396.74 | Fold064 |
| 144151.5 | 0.8109968 | 97604.23 | Fold065 |
| 207927.4 | 0.7516596 | 107100.11 | Fold066 |
| 171202.7 | 0.6676488 | 100866.86 | Fold067 |
| 127927.8 | 0.7162603 | 90090.60 | Fold068 |
| 201288.7 | 0.6709075 | 102839.46 | Fold069 |
| 175163.2 | 0.7440159 | 98329.05 | Fold070 |
| 360557.1 | 0.3425005 | 113116.91 | Fold071 |
| 182137.0 | 0.6813015 | 101403.91 | Fold072 |
| 165871.2 | 0.7328143 | 98821.93 | Fold073 |
| 125076.4 | 0.7147360 | 92119.64 | Fold074 |
| 175945.2 | 0.7866469 | 96858.68 | Fold075 |
| 206174.7 | 0.7391501 | 116005.52 | Fold076 |
| 147059.0 | 0.7775983 | 98824.30 | Fold077 |
| 166975.2 | 0.7966021 | 95503.65 | Fold078 |
| 170504.3 | 0.6983789 | 96905.48 | Fold079 |
| 140757.6 | 0.7760661 | 90882.39 | Fold080 |
| 139492.8 | 0.7807296 | 93847.61 | Fold081 |
| 132556.4 | 0.8508215 | 91322.44 | Fold082 |
| 175608.0 | 0.8201767 | 101951.87 | Fold083 |
| 192816.9 | 0.8012079 | 95045.92 | Fold084 |
| 152448.5 | 0.7896580 | 90708.35 | Fold085 |
| 170354.3 | 0.7492709 | 103824.44 | Fold086 |
| 169118.5 | 0.7430909 | 101156.30 | Fold087 |
| 239328.8 | 0.7751722 | 107760.72 | Fold088 |
| 381995.9 | 0.4821003 | 110335.99 | Fold089 |
| 146055.7 | 0.8678784 | 100494.46 | Fold090 |
| 175482.8 | 0.6861038 | 98360.95 | Fold091 |
| 177243.9 | 0.7426306 | 100420.13 | Fold092 |
| 186142.0 | 0.7794631 | 109381.21 | Fold093 |
| 243245.1 | 0.6360583 | 108347.35 | Fold094 |
| 200240.9 | 0.6826500 | 111066.18 | Fold095 |
| 178125.8 | 0.7755755 | 104399.46 | Fold096 |
| 158678.7 | 0.8112389 | 93617.09 | Fold097 |
| 172738.6 | 0.7000035 | 105153.28 | Fold098 |
| 166553.7 | 0.7018630 | 100582.40 | Fold099 |
| 131337.1 | 0.8126508 | 87196.65 | Fold100 |
# plot MAE distribution
ggplot() +
geom_histogram(data = Mecklenburg.CV$resample, aes(MAE), bins = 50, fill = "orange") +
labs(title="Distribution of MAE",
subtitle = "k-fold cross validation; k = 100",
x="Mean Absolute Error", y="Count") +
plotTheme()
The Figure below demonstrated the predicted prices as a function of observed price, indicating that there are still some errors existing between the regression line (shown in blue) and the reference line (shown in orange). Both the training and test sets are in the same situation.
## Merge Predicted and Observed Price
Mecklenburg.train$Source <- "Training set"
Mecklenburg.test$Source <- "Test set"
Mecklenburg.alldata <-
rbind(Mecklenburg.train, Mecklenburg.test %>%
dplyr::select(-Regression))
ggplot(Mecklenburg.alldata, aes(x = price, y = Price.Predict, color = Source)) +
geom_point() +
geom_smooth(method = "lm", color = "blue") +
geom_abline(color = "orange") +
labs(title = "Predicted Prices as a Function of Observed Prices",
subtitle = 'Blue line is model regression line\nwhile orange line is reference line\n',
caption = '',
x = "Observed Prices",
y = "Predicted Prices") +
theme() +
plotTheme()
## By sets
ggplot(Mecklenburg.alldata, aes(x = price, y = Price.Predict, color = Source)) +
geom_point() +
geom_smooth(method = "lm", color = "blue") +
geom_abline(color = "orange") +
coord_equal() +
theme_bw() +
facet_wrap(~Source, ncol = 2) +
labs(title = "Predicted Prices as a Function of Observed Prices, by Sets",
subtitle = 'Blue line is model regression line while orange line is reference line\n',
caption = '',
x = "Observed Prices",
y = "Predicted Prices") +
theme() +
plotTheme()
This section used three different tools to examine if prediction errors display certain spatial patterns.
Mapping residuals: The maps below shows that most of the errors are clustering in the central and southern Mecklenburg.
The spatial lag in errors: A spatial lag is a variable that averages the neighboring values of a location. It shows that as house price errors increase, the nearby house price errors slightly increase.
Moran’s I: Moran’s I is a measure of the overall spatial autocorrelation. In general, when Moran’s I > 0, it indicates positive spatial correlation. The larger the value, the more obvious the spatial correlation. When Moran’s I < 0, it indicates negative spatial correlation, the smaller the value, the larger the spatial difference, otherwise, When Moran’s I = 0, the space is random.
ggplot() +
geom_sf(data=st_union(Mecklenburg.tracts20)) +
geom_sf(data = Mecklenburg.test, aes(color = q5(Price.Error)), show.legend = "point",size = 0.8) +
scale_color_manual(values = alpha(new_palette5, .6),
labels = qBr(Mecklenburg.test,'Price.Error'),
name = "Residuals") +
labs(title = "Map of residuals on test set\n") +
mapTheme() +
plotTheme()
ggplot() +
geom_sf(data=st_union(Mecklenburg.tracts20)) +
geom_sf(data = Mecklenburg.test, aes(color = q5(Price.AbsError)), show.legend = "point",size = 0.8) +
scale_color_manual(values = alpha(new_palette5, .6),
labels = qBr(Mecklenburg.test,'Price.AbsError'),
name = "Absolute Errors") +
labs(title = "Map of absolute errors on test set\n") +
mapTheme() +
plotTheme()
The plot below is the spatial lag in errors.
# Average Errors in Five Nearest Neighbors
Mecklenburg.test_predict <- Mecklenburg.test[which(Mecklenburg.test$Price.Error != 0),]
coords.test <- Mecklenburg.test_predict %>%
st_coordinates()
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
Mecklenburg.test_predict$lagPriceError <- lag.listw(spatialWeights.test, Mecklenburg.test_predict$Price.Error)
ggplot(Mecklenburg.test_predict, aes(x = lagPriceError, y = Price.Error)) +
geom_point(color = "black") +
geom_smooth(method = "lm", se = FALSE, colour = "darkred") +
labs(title = "Error on Test Set as a function of the Spatial Lag of Price\n",
caption = "",
x = "Spatial lag of errors (Mean error of 5 nearest neighbors)",
y = "Errors of Sale Price") +
plotTheme()
In our model, the Moran’s I is 0.19, which means our model is randomly distributed to some extend.
moranTest <- moran.mc(Mecklenburg.test_predict$Price.Error,
spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange",
x="Moran's I",
y="Count") +
plotTheme()
Applying the model to the whole dataset, the distribution of price in different set is shown in the map below. Housing prices are significant higher in the central, southern area of Mecklenburg, and a few northern areas. There is no obvious difference between the pattern of the distribution of price predictions in the two datasets.
Mecklenburg.modelling$Price.Predict <- predict(Mecklenburg.model, newdata = Mecklenburg.modelling)
Mecklenburg.challenge$Price.Predict <- predict(Mecklenburg.model, Mecklenburg.challenge)
ggplot() +
geom_sf(data=st_union(Mecklenburg.tracts20)) +
geom_sf(data = Mecklenburg.modelling, aes(color = q5(Price.Predict)), show.legend = "point",size = 0.8) +
scale_color_manual(values = alpha(new_palette5, .6),
labels = qBr(Mecklenburg.modelling,'Price.Predict'),
name = "Price, $") +
labs(title = 'Map of Predicted Price in Modelling set\n',
subtitle = '',
caption = '') +
plotTheme() +
mapTheme()
ggplot() +
geom_sf(data=st_union(Mecklenburg.tracts20)) +
geom_sf(data = Mecklenburg.challenge, aes(color = q5(Price.Predict)), show.legend = "point",size = 2) +
scale_color_manual(values = alpha(new_palette5, .6),
labels = qBr(Mecklenburg.challenge,'Price.Predict'),
name = "Price, $") +
labs(title = 'Map of Predicted Price in Challenge set\n',
subtitle = '',
caption = '') +
plotTheme() +
mapTheme()
The map below is the mean absolute percentage error (MAPE) by neighborhood of the test set predictions.
Mecklenburg.nhood <- Mecklenburg.test %>%
group_by(id) %>%
summarize(meanPrice = mean(price, na.rm = T),
meanPrediction = mean(Price.Predict, na.rm = T),
meanMAE = mean(Price.AbsError, na.rm = T),
meanMAPE = mean(Price.APE, na.rm = T)) %>%
st_drop_geometry %>%
left_join(Mecklenburg.neighborhood,by = 'id') %>%
dplyr::select(-X2020)
# kable(Mecklenburg.nhood) %>%
# kable_styling("striped", full_width = F)
Mecklenburg.nhood <-
Mecklenburg.nhood %>%
st_sf()
warm_palette5 <- c("#FBDA91", "#FB9E91", "#FB8691", "#FB0091", "#D2001A")
ggplot() +
geom_sf(data = Mecklenburg.nhood, aes(fill = q5(meanMAPE))) +
scale_fill_manual(values = new_palette5,
labels = qBr(Mecklenburg.nhood, "meanMAPE", rnd = F),
name="Quintile\nBreaks") +
geom_sf(data = Mecklenburg.test, aes(color = q5(Price.Predict)),show.legend = "point", size = .5) +
scale_color_manual(values = alpha(warm_palette5, .5),
labels = qBr(Mecklenburg.test,'Price.Predict'),
name = "Price, $") +
labs(title = "Mean Test Set MAPE by Neighborhood") +
mapTheme()
The plot below is the MAPE by neighborhood as a function of mean price by neighborhood.
ggplot()+
geom_point(data = Mecklenburg.nhood, aes(x = meanPrice, y = meanMAPE), color = "lightblue") +
stat_smooth(data = Mecklenburg.nhood, aes(x = meanPrice, y = meanMAPE), method = "loess", se = FALSE, size = 1, colour="red") +
labs(title = "MAPE by Neighborhood as a Function of\nMean Price by Neighborhood\n",
caption = '') +
xlab('Mean Price by Neighborhood') +
ylab('Mean MAPE by Neighborhood') +
plotTheme()
We have split the study area into four groups, according to race, income, vacant and renter occupied. The table results are as follows. Majority White and high income areas have more accurate prediction results. The maximum variation within the same group reached 5% in the race and renter group. The other two groups had little variation, which means our model is reliable in generalizability to some extend.
Mecklenburg.tracts20 <-
Mecklenburg.tracts20 %>%
mutate(raceContext = ifelse(pctWhite > .5, "Majority White", "Majority Non-White"),
incomeContext = ifelse(MedHHInc > mean(MedHHInc,na.rm = T), "High Income", "Low Income"),
vacantContext = ifelse(pctTotalVacant > .5, "Majority Vacant", "Majority Non-Vacant"),
pctRenterContext = ifelse(pctRenterOccupied > .5, "Majority Renter Occupied", "Majority Non-Renter Occupied"))
grid.arrange(ncol = 2,
ggplot() + geom_sf(data = na.omit(Mecklenburg.tracts20), aes(fill = raceContext)) +
scale_fill_manual(values = c("#85F4FF", "#FB9E91"), name="Race Context") +
labs(title = "Race Context") +
mapTheme() + theme(legend.position="bottom"),
ggplot() + geom_sf(data = na.omit(Mecklenburg.tracts20), aes(fill = incomeContext)) +
scale_fill_manual(values = c("#85F4FF", "#FB9E91"), name="Income Context") +
labs(title = "Income Context") +
mapTheme() + theme(legend.position="bottom"))
grid.arrange(ncol = 2,
ggplot() + geom_sf(data = na.omit(Mecklenburg.tracts20), aes(fill = vacantContext)) +
scale_fill_manual(values = c("#85F4FF", "#FB9E91"), name="Vacant Context") +
labs(title = "Vacant Context") +
mapTheme() + theme(legend.position="bottom"),
ggplot() + geom_sf(data = na.omit(Mecklenburg.tracts20), aes(fill = pctRenterContext)) +
scale_fill_manual(values = c("#85F4FF", "#FB9E91"), name="pctRenter Context") +
labs(title = "pctRenter Context") +
mapTheme() + theme(legend.position="bottom"))
Mecklenburg.race <- st_join(Mecklenburg.train, Mecklenburg.tracts20) %>%
group_by(raceContext) %>%
summarize(meanMAPE = mean(Price.APE, na.rm = T)) %>%
st_drop_geometry() %>%
spread(raceContext, meanMAPE) %>%
kable(caption = "Table 4.10.1\nTest set MAPE by Neighborhood Racial Context") %>%
kable_styling("striped", full_width = F)
Mecklenburg.race
| Majority Non-White | Majority White |
|---|---|
| 0.2852243 | 0.2372819 |
Mecklenburg.income <- st_join(Mecklenburg.train, Mecklenburg.tracts20) %>%
group_by(incomeContext) %>%
summarize(meanMAPE = mean(Price.APE, na.rm = T)) %>%
st_drop_geometry() %>%
spread(incomeContext, meanMAPE) %>%
kable(caption = "Table 4.10.2\nTest set MAPE by Neighborhood Income Context") %>%
kable_styling("striped", full_width = F)
Mecklenburg.income
| High Income | Low Income |
|---|---|
| 0.2417418 | 0.2775655 |
Mecklenburg.vacant <- st_join(Mecklenburg.train, Mecklenburg.tracts20) %>%
group_by(vacantContext) %>%
summarize(meanMAPE = mean(Price.APE, na.rm = T)) %>%
st_drop_geometry() %>%
spread(vacantContext, meanMAPE) %>%
kable(caption = "Table 4.10.3\nTest set MAPE by Neighborhood Vacant Context") %>%
kable_styling("striped", full_width = F)
Mecklenburg.vacant
| Majority Non-Vacant | Majority Vacant |
|---|---|
| 0.2332548 | 0.2602928 |
Mecklenburg.rent <- st_join(Mecklenburg.train, Mecklenburg.tracts20) %>%
group_by(pctRenterContext) %>%
summarize(meanMAPE = mean(Price.APE, na.rm = T)) %>%
st_drop_geometry() %>%
spread(pctRenterContext, meanMAPE) %>%
kable(caption = "Table 4.10.4\nTest set MAPE by Neighborhood Renter Occupied Context") %>%
kable_styling("striped", full_width = F)
Mecklenburg.rent
| Majority Non-Renter Occupied | Majority Renter Occupied |
|---|---|
| 0.2413728 | 0.2907168 |
In general, the model is effective in predicting house price. The results shows that the R-squared reached 0.7, which means the feature we selected in the model can explain 70% variation of house prices. However, for the generalizability, it performs not very well. The MAE is up to 104032.5 and MAPE is 0.2738515. It might because that we didn’t consider neighborhood effects and also some feature might fit to some region but not suitable for other places. Therefore, when using new data to test our model, it may not perform very well.
There are some interesting variables which can be considered useful by common sense but actually not. These features are like parks and employments.
We believe that the intrinsic characteristics of the house, such as heated area and number of bedrooms, generally correlate well with price. Then, we consider the properties of the house itself as important features.
The average difference between our price forecasts and actual prices is about 104,032.5. and most of the errors occur in the central and southern parts of Mecklenburg. In addition, the higher the actual price, the larger the error.
Prices in central Mecklenburg are relatively lower than in other regions, especially in the north and south. We speculate that this is related to race and income.
The model performs well in the marginal part of Mecklenburg, especially in the western part, and poorly in the middle part of Mecklenburg. The reason for this is that some features may be important for areas with large errors, but we did not include them in the model. Therefore, the model does not predict these areas well. As for those regions with smaller errors, the features we chose for the model may be just appropriate.
Based on our findings, we would recommend our model to Zillow, but I would definitely add some important caveats - our findings can only be used as an auxiliary reference, not as a key determinant, as it is poorly accurate. There are several areas that we need to improve in the future Zillow’s use:
We can use more advanced feature engineering methods to transform variables, such as logarithms or power functions, to make them perform better in OLS models.
Add some uncovered indicators to the model, especially those that help predict areas of high house prices.
Try other types of regression instead of OLS to reduce the error caused by non-linear variables. And if we can, we could apply Deep Learning method to fit our model, which will help a lot.
In this way, we believe our model can provide Zillow a better prediction of house price in the future.
[1]Wang Y. House-price Prediction Based on OLS Linear Regression and Random Forest[C]//2021 2nd Asia Service Sciences and Software Engineering Conference. 2021: 89-93.